Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164)#576
Closed
cmcdnd wants to merge 1 commit intoopenai:mainfrom
Closed
Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164)#576cmcdnd wants to merge 1 commit intoopenai:mainfrom
cmcdnd wants to merge 1 commit intoopenai:mainfrom
Conversation
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
Builds on Run 1 (PR openai#414 + LeakyReLU). Adds: - temperature param to eval_val_sliding (default 1.0, no change) - After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99} - PR openai#576 reported T=0.98 gives -0.003 bpb for free 10 lines added over Run 1. Zero training cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
Builds on Run 2. Changes from PR openai#414 base: - MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params) - Quantization: int6 → int5 (clip_range 31→15, fits more params) - QAT: enabled with threshold 0.5 (early start, matching PR openai#576) - QAT uses quantile(0.9995) clip instead of row max - BigramHash: 2048 → 8192 buckets From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb). 8 lines changed from Run 2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
It looks to me like this code uses training data at eval time due to the post-pruning calibration scheme, so I think this submission is invalid. |
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
Builds on Run 1 (PR openai#414 + LeakyReLU). Adds: - temperature param to eval_val_sliding (default 1.0, no change) - After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99} - PR openai#576 reported T=0.98 gives -0.003 bpb for free 10 lines added over Run 1. Zero training cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi
added a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Mar 24, 2026
Builds on Run 2. Changes from PR openai#414 base: - MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params) - Quantization: int6 → int5 (clip_range 31→15, fits more params) - QAT: enabled with threshold 0.5 (early start, matching PR openai#576) - QAT uses quantile(0.9995) clip instead of row max - BigramHash: 2048 → 8192 buckets From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb). 8 lines changed from Run 2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nishant-resolve-ai
pushed a commit
to nishant-resolve-ai/parameter-golf
that referenced
this pull request
Mar 24, 2026
- Autoresearch loop (program.md, loop.sh, generate_next.py) - Modal provider for 8xH100 training with checkpoint save/restore - Experiment framework with preflight size checks - eval_ttt.py for TTT evaluation against saved checkpoints - train_gpt_improved.py: PR openai#569 base (VRL, GPTQ, LeakyReLU², pruning) - train_gpt_576.py: PR openai#576 base (int5, 33.6M params, score-first TTT) - train_gpt_sota.py: PR openai#573 base - train_gpt_mlx_recurrent.py: depth recurrence experiments - Benchmark scripts for local MLX A/B testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa
added a commit
to RoyiRa/parameter-golf
that referenced
this pull request
Mar 25, 2026
…ensation Implement GPTQ (Hessian-aware) quantization for int5 (31 levels, clip=15). Uses Cholesky-based error redistribution across columns for minimal quant damage. Calibrates on 256 training sequences. Enables fitting 12L+ models within 16MB artifact limit. Controlled by GPTQ_ENABLED=1 (default: off). Based on PR openai#576's technique (1.1162 BPB with 33.6M int5 params). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa
added a commit
to RoyiRa/parameter-golf
that referenced
this pull request
Mar 25, 2026
12L models: - int6: 1.1139 BPB (17.56MB, over limit) - int5 GPTQ: 1.1254 BPB (14.24MB, fits but +0.011 damage) - int5 GPTQ aligned QAT: 1.1254 BPB (same, alignment didn't help) - No bigram: 1.1153 BPB (16.53MB, still over) 11L int6 GPTQ: 1.1293 BPB (GPTQ hurts int6) Key finding: int5 quantization damage is ~+0.012 BPB even with GPTQ. Need PR openai#576's Soft-Round QAT (tanh-based) for better alignment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ibarrajo
added a commit
to ibarrajo/parameter-golf
that referenced
this pull request
Mar 28, 2026
Train larger (33.6M params, d=576, MLP 3.5x), quantize harder (int5 GPTQ). Legal score-first TTT (AdamW, cosine LR, 3 epochs) + post-TTT temperature calibration (T=0.98). 3-seed mean 1.1145 BPB (std 0.0003). Based on PR openai#576. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
ibarrajo
added a commit
to ibarrajo/parameter-golf
that referenced
this pull request
Mar 28, 2026
Approach A (openai#569 int5 no TTT): 1.1317 — int5 penalty too high on d=512 Approach B (openai#576 d=576 int5 + legal s_0 TTT): 1.1188 — best legal result Approach C (GEPA int5 + TTT): artifact over 16MB Key lesson: TTT re-scoring is illegal (PR openai#991 closed for this). Only s_0 cumulative first-pass score is legal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Train Larger, Quantize Harder: 33.6M params quantized to int5 with full Hessian GPTQ, fitting in 15.6MB. Adds post-TTT temperature calibration (T=0.98) which corrects TTT-induced overconfidence for an additional -0.003 BPB - a novel technique not used in prior submissions.
Builds on my int5 QAT approach from PR #469 (first one with more params) and the 33.6M architecture from PR #545.
Architecture: 11L, 512d, MHA 8/8, MLP 3.5x (1792), BigramHash 8192, XSA all layers, LeakyReLU², VE128
Quantization: Int5 per-row GPTQ (clip_range=15) + Early QAT (threshold 0.5) + EMA 0.997 + 2% pruning
Eval: Score-first TTT at T=1.0 (AdamW, lr=1e-4, chunk=131K, last 2 blocks) → re-score at T=0.98
Score-first TTT systematically makes the model overconfident. A fixed temperature T=0.98 on post-TTT logits recovers ~0.003 BPB at zero cost:
Results
SOTA improvement: 1.8958 - 1.8861 = 0.0097 nats (threshold: 0.005, p << 0.01)
Reproduction
Eval Time Budget (~380s, under 10 min limit)